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Abstract 

Non-adaptive geostatistical designs (NAGD) offer standard ways of collecting and analysing 
geostatistical data in which sampling locations are fixed in advance of any data collection. 
In contrast, adaptive geostatistical designs (AGD) allow collection of exposnre and ontcome 
data over time to depend on information obtained from previous information to optimise 
data collection towards the analysis objective.AGDs are becoming more important in spatial 
mapping, particularly in poor resource settings where uniformly precise mapping may be 
unrealistically costly and priority is often to identify critical areas where interventions can 
have the most health impact. Two constructions are: singleton and hatch adaptive sampling. 
In singleton sampling, locations Xi are chosen sequentially and at each stage, depends on 
data obtained at locations Xi,... ,Xk- In batch sampling, locations are chosen in batches of 
size 6 > 1, allowing new batch, {a;(fc+i),... ,a;(fc+fe)}, to depend on data obtained at locations 
xi,... ,Xkb- In most settings, batch sampling is more realistic than singleton sampling. We 
propose specific batch AGDs and assess their efficiency relative to their singleton adaptive 
and non-adaptive counterparts by using simulations.We show how we apply these findings to 
inform an AGD of a rolling Malaria Indicator Survey, part of a large-scale, five-year malaria 
transmission reduction project in Malawi. 

Keywords. Adaptive sampling strategies. Spatial statistics. Geostatistics, Malaria, Prevalence 
mapping 
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1 Introduction 


Geostatistics has its origins in the South African mining industry (Krige, 1951), and was 


subsequently developed by Georges Matheron and colleagues into a self-contained methodology 


for solving prediction problems arising principally in mineral exploration; Chiles and Delfiner 


(2012) is a recent book-length account. Within the general statistics research community, the 
term geostatistics more generally refers to the branch of spatial statistics that is concerned 
with investigating an unobserved spatial phenomenon S = {S{x) : x ^ D C IR^} , where H is a 
geographical region of interest, using data in the form of measurements Ui at locations Xi G D. 
Typically, each r/j can be regarded as a noisy version of S{xi). We write X = {xi,... ,a;„} and 
call X the sampling design. 


Geostatistical analysis can address either or both of two broad objectives: estimation of the 
parameters that dehne a stochastic model for the unobserved process S and the observed 
data {{iii^Xi) : f = 1, prediction of the unobserved realisation of S{x) throughout D, or 

particular characteristics of this realisation, for example its average value. 

A key consideration for geostatistical design is that sampling designs that are efficient for 
parameter estimation are generally inefficient for prediction, and vice versa. Since parameter 
values are always unknown in practice, design for prediction therefore involves a compromise. 
Furthermore, the diversity of potential predictive targets requires design strategies to be 
context-specific. Another important distinction is between non-adaptive sampling designs that 
must be completely specified prior to data-collection, and adaptive designs, for which data are 
collected over a period of time and later sampling locations can depend on data collected from 
earlier locations. 


In this paper we formulate, and evaluate through simulation studies, a class of adaptive design 
strategies that address two compromises: between efficient parameter estimation and efficient 
prediction; and between theoretical advantages and practical constraints. The motivation 
for our work is the mapping of malaria prevalence in rural communities through a series of 
“rolling malaria indicator surveys,” henceforth rMIS (Roca-Feltrer, Lalloo, Phiri, and Terlouw 


2012). Malaria prevalence is highly heterogenous in time and space. Adaptive design is 


especially relevant here because resource constraints make it difficult to achieve uniformly 
precise predictions throughout the region of interest. Hence, as data accrue over the study- 
region D it becomes appropriate to focus progressively on sub-regions of D where precise 
prediction is needed to inform public health action, for example to prioritise sub-regions for 
early intervention. 


In Section we review the existing literature on adaptive geostatistical design and set out the 
methodological framework within which we will specify and evaluate adaptive design strategies. 
Section describes our proposed class of adaptive designs for efficient prediction. Section 
gives the results of a simulation study in which we compare the predictive efficiency of our 
proposed design strategy with simpler, non-adaptive strategies. Section is an application 
to the design of an ongoing prevalence mapping exercise around the perimeter of the Majete 
wildlife reserve, Ghikwawa District, Southern Malawi through an rMIS that will be conducted 
monthly over a two-year period. Section is a concluding discussion. 
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Methodological framework 


2.1 Geostatistical models for prevalence data 


The standard geostatistical model for prevalence data can be formnlated as follows ( [Diggle, 
Tawn, and Moyeed, 1998). For i = let be the nnmber of positive ontcomes ont 

of Hi individnals tested at location Xi in a region of interest D C and di G R^ a vector 
of associated covariates. The model assnmes that ~ Binomial(ni,p(a;j)) where p{x) is the 
prevalence of disease at a location x. The model fnrther assnmes that 


\og[p{xi)/{l -p{xi)]] = d{xi)'l3 + S{xi) 


( 1 ) 


where S{x) is a stationary Ganssian process with zero mean, variance and correlation 
function p{u) = Corr{(S'(a;), where u is the distance between x and x'. 

Fitting the standard model involves computationally intensive Monte Carlo methods, but 


software implementations are available; we use the R package PrevMap (Giorgi and Diggle 


2015). Stanton and Diggle (2013) show that provided the n* are at least 100 and \p{x) — 0.5| is 
at most 0.4, reliable predictions can be obtained using the following computationally simpler 
approach. Define the empirical logit transform, 


Y* = log{(T, + 0.5)/(ni - T, + 0.5)} 


and assume that 


Y* =d{x,yfd + S{x,) + Z,, 


( 2 ) 


where, as in ([^, S{x) is a stationary Gaussian process with variance cr^ and correlation function 
p{u), and the Zi are mutually independent zero-mean Gaussian random variables with variance 
r^. Using this approximate method, predictive inferences need to be back-transformed from 
the logit to the prevalence scale. 


In what follows, we will assume a Matern (1960) correlation structure for S{x) 


p{u- 0; k) = (3) 

where 0 > 0 is a scale parameter that controls the rate at which correlation decays with 
increasing distance, K^^-) is a modihed Bessel function of order k > 0, and S{x) is m times 
mean-square differentiable ii k > m. In the simulation studies reported in Section]^ we use the 
computationally simpler, approximate method to compare different designs and do not include 
covariates. For the analyses of the Majete data reported in Section we use the standard 
model ([^. 
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2.2 Likelihood-based inference under adaptive design 


Almost all geostatistical analyses are conducted under the assumption that the sampling design, 
X, is stochastically independent of S. This justihes basing inference on the likelihood function 
corresponding to the conditional distribution of Y given X, which typically gives information 
on all quantities of interest. Diggle, Menezes, and Su (2010) discuss the inferential challenges 
that result when the independence assumption does not hold, in which case the data {X, Y) 

Diggle, MenezesJ 


should strictly be considered jointly as a realisation of a marked point process. 


and Su (2010) call this preferential sampling; see also 


Pati, Reich, and Dunson 


(2011), GelfandJ 

Sahu, and Holland (2012), Shaddick and Zidek (2014), and Zidek, Shaddick and Taylor (2014) 


In adaptive design, X and S are not independent but are conditionally independent given 
Y, which simplifies the form of the likelihood function. To see why, let Xq denote an initial 
sampling design chosen independently of S, and Tq the resulting measurement data. Similarly 
denote by Xi the set of additional sampling locations added as a result of analysing the initial 
data-set (Xo,To)) the resulting additional measurement data, and so on. After k additions, 
the complete data-set consists of X = Xq U Xi U ... U X^ and Y = (To, hi,..., Yk). Using the 
notation [•] to mean “the distribution of”, the associated likelihood for the complete data-set is 


IX,Y]= [lX,Y,S]dS. (4) 

Js 

We consider first the case k = 1. The standard factorisation of any multivariate distribution 
gives 


[x,x,5] = [5,Xo,Xo,Xi,Xi] = [^][Xo|^][yo|Xo,^][Xi|yo,Xo,^][yj|Xi,yo,Xo,^]. (5) 

On the right-hand side of (|^, note that by construction, [Xo|S'] = [Xq] and [Xi|yo,-^ 0 ) S] = 
[Xi|yo,-^o]- If then follows from (|^ and ([^ that 


IW = lA'ollA'ilA'o.n] X /'|K„|X„,S||Ki|A'i,y„,X„,S||S|<iS 

= |A|y„] X |y|A]. (6) 

This shows that the conditional likelihood, [X|X], can legitimately be used for inference 
although, depending on how [XITq] is specihed, it may be inefficient. The argument leading to 
([^ extends to A; > 1 with essentially only notational changes. 
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3 An adaptive design strategy 

3.1 Performance criteria 


In practice, each geostatistical prediction exercise will have its own, context-specihc primary 
objective. To provide a framework for a general discussion, let S = {S{x) : x E D} denote 
the realisation of the process S{x) over D. Also, let Y denote the data obtained from the 
sampling design X = {xi,...,a;„}, and Y = (Yi,...,Yn) the corresponding measurement data. 
Denote by T = T{S), called the predictive target, represent the property of S that is of primary 
interest. A generic measure of the predictive accuracy of a design X is its mean square error, 
MSE{X) = E[(T — T)^], where T = E[T|T; A] is the minimum mean square error predictor 
of T for any given design X. Note that in the expression for MSE{X) the expectation is with 
respect to both S and Y, whereas in the expression for T it is with respect to S holding Y 
hxed at its observed value. 


One obvious predictive target is T{x) = S{x) for an arbitrary location x E D. Another, which 
may be more relevant when the practical goal is to decide whether or not to launch a pnblic 
health intervention, is a complete map T{x) = I{S{x) > c), where /(■) is the indicator function 
and c is a policy-relevant threshold; see, for example. Figure 3 of|Zoure, Noma, Tekle, Amazigo 


Diggle, Giorgi, and Remme (2014). Spatially neutral versions of these targets can be dehned 


by integration over D, hence 


IMSE{X) = [ E[{T{x) - f{x)f]dx. 

Jd 


We emphasise that in any particular application, other measures of performance may be more 
appropriate. However, for a comparative evaluation of different general design strategies, we 
adopt IMSE{X) as a sensible generic measure. 


3.2 Some non-adaptive geostatistical designs 


Two standard non-adaptive designs are a completely random design, in which the sample 
locations Xt form an independent random sample from the uniform distribution on D, and a 
completely regular design in which the Xi form a regular square or, less commonly, triangular 
lattice. Geostatistical design problems can be classihed according to whether the primary 
objective is parameter estimation or spatial prediction and, in the latter case, whether model 
parameters are assumed known or unknown. Our focus is on design for efficient prediction 
when model parameters are unknown, this being the ultimate goal of most geostatistical 
analyses. Gompletely regular designs typically give efficient prediction when the target is the 
spatial average of S{x), i.e. T = S{x)dx, and model parameters are known; see, for example. 


Matern (1960, Ghapter 5). When parameters are unknown, less regular designs have been 


shown to be preferable in particular settings see, for example, Diggle and Lophaven (2006), 


although a general theory of optimal geostatistical design is lacking. 
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Most of the previous research on design considerations for prediction assumes a known covariance 
structure for the data, see, for example, Benhenni and Cambanis ( |1992 ); Muller (2005) and 
Ritter (1996). Su and Cambanis (1993) address the problem of estimating parameters from a 


random process with a hnite number of observations, and measure the design performance by 
integrated mean sqnare error. They show that random designs are asymptotically optimal. 


McBratney, Webster, and Bnrgess (1981) address the problem of choosing the spacing of a 


regular rectangular or triangular lattice design to achieve an acceptable value of the maximum 


of the prediction variance over the region of interest. Yfantis, Flatman, and Behar (1987) 


compare three regular sampling designs, namely the sqnare, eqnilateral triangle and regular 
hexagonal lattices. They conclude that the hexagonal design is the best when the nugget effect 
is large and the sampling density is sparse. 


Royle and Nychka (1998) and Nychka and Saltzman (1998) use a geometrical approach that 
does not depend on the covariance structure of the underlying process S{x). In this approach, 
sample points are located in a way that minimises a criterion that is a function of the distances 
between sampled and non-sampled locations. 


Royle and Nychka (1998) show that the resulting 


space-filling designs generally perform well. 


In contrast to the spatial designs for efficient prediction reviewed above, Zhu and Stein (2005) 


consider designing for efficient covariance strncture estimation. They assnme the Gaussian 
model (|^ without covariates. Their design criterion is 

Vo{S;e) = -log det 1(6,3) 

where X(6*) is the information matrix of the covariance parameters. This is equivalent to 


D—optimality in the context of a linear model with uncorrelated measurement errors. Russo 


(1984), Muller and Zimmerman (1999), and Bogaert and Russo (1999) consider variogram- 


based, rather than likelihood-based, parameter estimation. The variogram of S(x) is the 


function 'y(u) = ^Var{S'(x) — S(x')} where u is the distance between x and x'. Muller and 


Zimmerman (1999) regard a design as optimal if it minimises a suitable measure of the “size' 


of the covariance matrix of the resulting parameter estimates. 

More often than not, it is desirable to have designs that compromise between the two analysis 
objectives of parameter estimation and spatial prediction. Usually, the same dataset is used for 


covariance structure estimation and prediction of S(x) at unsampled locations. Zhu and Stein 


(2006) address the problem of spatial sampling design for prediction of stationary isotropic 


Gaussian processes with estimated parameters of the covariance structure. They employ a 
two-step algorithm that uses an initial set of locations Xq to hnd the best design for prediction 
with known covariance parameters and then, conditional on Xq, uses the rest to hnd the 
best design for estimation of those covariance parameters. Pilz and Spock (2006) address 


a similar design problem but using a model-based approach in choosing an optimal design 
for spatial prediction in the presence of uncertainty in the covariance structure. Using a 


Bayesian approach, Diggle and Lophaven (2006) consider designs that are efficient for spatial 


prediction when parameters are unknown. They looked at two different design scenarios, 
namefy: retrospective design, where they use as performance criterion the average prediction 
variance (APV), 

APV = [ VaT{S(x)\Y}dx, (7) 

J D 
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and prospective design, with performance criterion the expectation of APV, with respect to the 
process S{x). They concluded that in either situation, inclusion of close pairs in an otherwise 
regular lattice design is generally a good choice. 

3.3 A class of adaptive designs 

Our proposed approach to adaptive geostatistical design is as follows. 

1. Specify the hnite set, X* say, of n* potential sampling locations Xi ^ D. In our motivating 
application, this consists of the locations of all households in their respective villages in 
the Majete perimeter area. In other applications, any point x G D may be a potential 
sampling location, in which case we take X* to be a hnely spaced regular lattice to cover 

D. 

2. Use a non-adaptive design to choose an initial set of sample locations, Xq = {xi G D : 
i = 1, ...,no}. 

3. Use the corresponding data Yq to estimate the parameters of an assumed geostatistical 
model. 

4. Specify a criterion for the addition of one or more new sample locations to form an 
enlarged set Xq U Xi. A simple example would be for Xi to be the elements of X* with 
the largest values of the prediction variance amongst all points not already included in 
Xo. 

5. Repeat steps 3 and 4 with augmented data Yi at the points in Xi. 

6. Stop when the required number of points has been sampled, a required performance 
criterion has been achieved or no more potential sampling points are available. 

Within this general approach, in addition to choosing a suitable addition criterion in step 4, we 
need to choose the number and locations of points in the initial design, Xq, and the number to 
be added at each subsequent stage, called the batch size. A batch size 6=1 must be optimal 
theoretically, but is often infeasible in practice. For example, in our application to prevalence 
mapping in the Majete wildlife reserve perimeter area, the associated sampling involves field 
work in challenging terrain and remote villages to obtain the measurements Y. Restricting 
each held-trip to collection of a single measurement would be a hopelessly inefficient use of 
limited resources. 

3.4 Types of adaptive designs 

We develop two main types of adaptive geostatistical designs namely; singleton and batch 
adaptive designs. 
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In singleton adaptive sampling, b = 1, i.e. locations are chosen seqnentially, allowing Xk+i to 
depend on data obtained at all earlier locations xi,... ,Xk- In singleton adaptive sampling, one 
possible addition criterion is to choose Xk+i to be the location x with the largest prediction 
variance of S{x) given the data from xi,... ,Xk. This is an example of a deterministic rule for 
identifying and adding new sample locations. 

In batch adaptive sampling, b > 1. A naive extension of the above addition criterion, choosing 
{xk+i, ...,Xk+h) to be the b available locations with the largest prediction variance of S{x), is 
likely to fail because it does not penalise sampling from multiple locations x at which the 
corresponding S{x) are highly correlated. 

3.5 Algorithm for adaptive geostatistical design 

For a predictive target T{x) = S{x), given an initial set of sampling locations Xq = {xi,..., Xng), 
the available set of additional sampling locations is Aq = X* \ Xq. For each x G Ag, denote by 
PV{x) the prediction variance, Var(T|yo)- For the Gaussian model ([^, 

PV{x) = a^{l-r'V-^r), 

where r = (ri,... ,r„Q) with V = a^R + r^J, R is the n hy n matrix with elements = 
p{\\xi — Xjll) and / is the identity matrix. 

We propose to incorporate a minimum distance addition criterion, whereby we choose new 
locations XnQ+i,Xno+ 2 , ■■■,Xno+b with the b largest values of PV{x) subject to the constraint 
that no two locations are separated by a distance of less than S. 

For a formal specihcation, we use the following notation: 

■ X* is the set of all potential sampling locations, with number of elements of n*-, 

■ Xg is the initial sample, with number of elements ng] 

■ b is the batch size; 

■ n = ng + kb is the total sample size; 

■ Aij,j>l,is the set of locations added in the batch, with number of elements 6; 

■ Aj = X* \ (Xq U ... U Xj) is the set of available locations after addition of the batch. 

The algorithm then proceeds as follows. 

1. Use a non-adaptive design to determine Xq. 

2. Set j=0 

3. For each x G Aj, calculate PV{x)\ 

(i) choose x* = arg max^WU(a;), 

(ii) if I lx* — Xj|| > 5, for all i = 1, ...,ng + jb, add x* to the design, 

(iii) otherwise, remove x* from Aj 

4. Repeat step 3 until b locations have been added to form the set Xj+i. 

5. Set Aj = Aj=i \ Xj and we update j to j + 1. 



6. Repeat steps 3 to 5 until the total number of sampled locations is n or Aj = 0. 


4 Simulation study 


We conducted a simulation study of adaptive geostatistical design (henceforth AGD) so as 
to compare its performance with standard examples of non-adaptive geostatistical designs 
(NAGD). Sampling in non-adaptive designs is based on a priori information and is fixed before 
the study is implemented Thompson and Collins (2002). Two examples of NAGD are: random 
and inhibitory design. Inhibitory designs use a constrained form of simple random sampling 


Diggle (2013) whereby the distance between any two sampled locations is reqnired to be at 


least 5. In this way, we retain the objective of a randomised design whilst gnaranteeing a 
relatively even spatial coverage of the stndy region. 

In each case, data were generated as a realisation of Gaussian process S(x) on a 64 by 64 grid 
covering the unit square, giving a total of n* = 4096 potential sampling locations. We specihed 
S{x) to have expectation /i = 0, variance cr^ = 1 and Matern correlation function (|^, with 0 
= 0.05 and k = 1.5, and no measnrement error, i.e. = 0. In each rnn of the simnlation. 


we nsed the adaptive design algorithm ontlined in Section |3.5| to sample a total of n = 100 
locations. We varied the initial sample size no between 30 and 90 and considered batch sizes 
6=1 (singleton adaptive sampling), 5 and 10. 


4.1 Adaptive vs non-adaptive sampling 

For the non-adaptive sampling of each realisation, and for the initial sample in adaptive 
sampling, we used an inhibitory design with 5 = 0.03. We evaluated each design by its spatially 
averaged prediction variance, i.e. APV as dehned at ([^, in tnrn averaged over 100 replicate 
simnlations. When the initial sample size is Uq = 30, the left-hand panel of Figure shows 
singleton adaptive sampling to have the lowest APV, achieving a value APV = 0.24. As the 
size of the batch increases, APV also increases, but remains substantially lower than the value 
APV=0.33 achieved by non-adaptive sampling. 

As the initial size uq increases towards n = 100, the APV for any of the AGDs necessarily 
approaches that of the NAGD. For example, the right-hand panel of Figure shows the results 
when uq = 50. The valne of APV ~ 0.33 when no = 90 and b = 10. For 6=1 and 5, APV 
generally remains low whilst steadily approaching that of NAGD when no increases towards n. 


9 









Minimum distance batch sampling: 
(Average Prediction Variance) 


— NAGD 

— AGD b=1 

— AGD b=5 

— AGD b=10 



Minimum distance batch sampling: 
(Average prediction variance) 



(a) no = 30 


(b) no = 50 


Figure 1: Non-adaptive (NAGD) vs minimum distance batch adaptive (AGD) sampling, with 
6 = 0.03 and AGD batch sizes 6 = 1,5 and 10. In the left-hand panel, no = 30; in the 
right-hand panel, no = 50. See text for details of the simulation model. 


5 Application: rolling malaria indicator surveys for 
malaria prevalence in the Majete perimeter 

In this Section, we illustrate the use of our proposed sampling methodology to construct a 
malaria prevalence map for part of an area of the commnnity surrounding Majete wildlife 
reserve within Chikwawa district (16° F S; 34 ° 47' E), in the lower Shire valley, sonthern 
Malawi. The Shire river (the biggest river in Malawi) runs throughout the length of Chikwawa 
district, cansing perennial flooding in the rainy season. Chikwawa is sitnated in a tropical 
climate zone with a mean annnal temperature of 26 °C, a single rainy season from November 
to April and annual rainfall of approximately 770 mm. The district has extensive rice and 
sugar-cane irrigation schemes. 

The area surrounding Majete wildlife reserve forms the region for a hve-year monitoring and 
evaluation study of malaria prevalence, with an embedded randomised trial of community-level 
interventions intended to reduce malaria transmission. The whole Majete perimeter is home to 
a population of ~ 100,000. Within this population, three distinct administrative units known 
as focal areas A, B and C have been selected to form the study region. These are spread over 
61 villages with 6,600 households and a population of ~ 24,500. Here, we illustrate adaptive 
sampling design methodology using data from focal area B, see Figure 

The hrst stage in the geostatistical design was a complete enumeration of households in the 
entire stndy region, inclnding their geo-location collected nsing Global Positioning System 
(GPS) devices on a Samsung Galaxy Tab 3 rnnning Android 4.1 Jellybean operating system. 
These devices are accnrate to within 5 meters. In the on-going rMIS, approximately 90 
households are sampled per month per focal area, so that each household will be visited twice 
over the two years of the study. Malaria prevalence is highly seasonal. The adaptive design 
problem therefore consists of deciding which households to sample in each of the hrst 12 months 
so as to optimise the precision of the resulting sequence of 12 prevalence maps. In year 2, the 
sampling design will be re-visited to take acconnt of both statistical considerations and any 
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practical obstacles encountered during the first year. Here, to illustrate the methodology, we 
use data from the hrst wave of sampling. 


5.1 Data 

The initial population-level continuous malaria indicator survey was conducted over the period 
April to June 2015. The survey recruited children aged less than 5 years and women of child 
bearing age, 15 to 49 years, in 10 village communities in order to monitor the burden of malaria. 
An inhibitory sampling design was used to sample an initial 100 households per focal area. 
Selection of the households was as follows. Households were randomly selected within each 
village from a list of enumerated households, whilst ensuring a good spatial coverage of the 
focal area by insisting that the distance between any two sampled households is not less than 
0.1 kilometres. Figure shows the sampled household locations (red dots) in their respective 
villages, with black dots indicating all households in each village. Data collected from the 
target population include individual level outcomes of a malaria rapid diagnostic test and 
covariates including age and gender. Household level covariates such as socioeconomic status 
and household location were also collected. 

For predictive mapping, any covariates included in the model must be available at all prediction 
locations. We used two digital elevation model (DEM) derivatives, elevation and normalized 
difference vegetation index (NDVI), which are readily available throughout the study region. 
Data for these covariates were derived using the Advanced Space-borne Thermal Emission and 
Reflection Radiometer (ASTER) Global DEM version 2. ASTER GDEM V2 has a spatial 
resolution of 30 meters. The data were downloaded from the United States Geological Survey 
(USGS) through their ‘Global Data Explorer’ http://gdex.cr.usgs.gov/gdex/, 


5.2 Results 


We emphasise that at this early stage of the Majete study the data are too sparse for a dehnitive 
prevalence analysis but sufficient for adaptive sampling design methodology illustration. The 
response from each individual in a sampled household is the binary outcome of a rapid 
diagnostic test (RDT) for the presence/absence of malaria from a hnger-prick blood sample. 
Out of the 100 households in the initial sample, 72 had at least one individual who met the 


inclusion criteria (see Section 5.f above). The total number of eligible individuals in these 


72 households was 126, with household size ranging from 1 to 8 individuals. For covariate 
selection we used ordinary logistic regression, retaining covariates with nominal p-values less 
than than 0.05. This resulted in the set of covariates shown in Table with terms for elevation, 
NDVI and the interaction between the two. We then htted the binomial logistic model ([^ 
to obtain the Monte Garlo maximum likelihood estimates of the parameters and associated 
95 % conhdence intervals also shown in Table [T| Each evaluation of the log-likelihood used 
10,000 simulated values, obtained by conditional simulation of 110,000 values and sampling 
every 10*^ realization after discarding a burn-in of 10,000 values. 
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Figure 2: Households within the Majete wildlife reserve perimeter in focal area B (black dots) 
and sampled household locations (white dots) shown in their respective villages. 


Term 

Estimate 

95 % Conhdence Interval 

Intercept 

-5.4827 

(-7.6760, -3.2893) 

Elevation 

0.02651 

(0.0162, 0.0368 ) 

NDVI 

4.6130 

(0.1581, 9.0680) 

Elev. X NDVI 

-0.0405 

(-0.0588, -0.0223) 


0.6339 

(0.4438, 0.9055) 

0 

0.2293 

(0.1042, 0.5049) 


Table 1; Monte Carlo maximum likelihood estimates and 95 % conhdence intervals for the 
model htted to the Majete malaria data. 


From Table elevation and NDVI show positive marginal associations with malaria, with 
a negative interaction. Focal area B is divided through its length by the Shire river. The 
north-east part has relatively high elevation and NDVI values. Prevalence is generally low in 
the south-west of the region, whereas the north-east has pockets of comparatively high malaria 
prevalence. This suggests that heterogeneity in malaria prevalence over focal area B involves 
other risk factors (social or environmental) that are not available in the current data. 

Figure shows the predicted prevalence at each of the observed locations. Households at high 
altitude and under dense vegetation cover have generally high malaria prevalence. For this 
study, the elevation of households varied from 60 to 460 meters above sea level. Rivers and 
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streams that are fast flowing in natnre are not generally favonrable for mosqnito larvae; the 
Shire river is a big and fast flowing river. Sampling was done at the time of peak malaria 
transmission at the end of the rainy season when rains subside. This could potentially explain 
the low prevalence in the southern part of the study region. Also, the high prevalence area in 
the north-east is generally more remote with far access to health facilities. 



Prevalence 

• 0.01 - 0.06 
0 . 06 - 0.11 
0.11 - 0.16 
0.16 - 0.21 
• 0.21 - 0.25 


Figure 3: Predictions of di^x)'(3 -\- S{x) at observed locations in focal area B. The blue lines 
shows Shire and Matope rivers. 


5.3 Adaptive sampling 


We now use the minimum distance batch adaptive sampling approach explained in Section 
3.5| to determine new locations that can and should be added to the existing sample in an 
adaptive manner. We hrst calculate the prediction variance at each household using the data 
from the 72 initial sample locations, shown as red dots in Figure Prediction variances 


range between 0.0003 and 0.0325, and are relatively small at locations closer to the observed 
locations, although this depends on the number of eligible individuals at each location. We 


then choose a sample of 50 additional locations using the algorithm outlined in Section 3^ 
above. The blue dots in Figure [4a] show these 50 new locations determined using the minimum 
distance threshold 6 = 0.15 kilometres. The new sampling locations are well spread across 
the study region, which is benehcial for area-wide spatial prediction. Also, although we have 
imposed 6 between any two sampled locations in order to penalise highly correlated multiple 
sample locations, the new sample locations nevertheless include some pairs of old and new 
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locations in which the new location has been chosen to be relatively close to an initial location 
with high prediction variance; recall that the number of eligible individuals per household 
varied between 1 and 8, hence the prediction variance at a sampled location is itself highly 
variable. As noted earlier, closely spaced pairs are helpful for effective spatial prediction when 
the true model parameters are not known, which is the reality in most geostatistical problems. 


In Figure 4b we show the prediction variance surface after addition of the 50 adaptively 


sampled locations, for the sub-region highlighted in Figure Locations with high prediction 
variance are potential candidates for the next round of adaptive sampling, subject to their 
meeting the minimum distance constraint. 


The adaptive sampling design criterion ensures that data are collected only from locations 
that will deliver useful additional information in order to understand the spatial heterogeneity 
throughout the study region. 
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Figure 4: (a) Initial inhibitory sampling design locations (red dots) and adaptive sampling 
design locations (blue dots) in focal area B. Inset shows a subset of locations, (b) Prediction 


variance surface for the inset sub-region from 4a 


6 Discussion 


In any particular application, the objectives of the study can and should inform the design 
strategy. We have developed an adaptive sampling strategy within a model-based geostatistics 
framework for survey based disease mapping in poor resource settings. The minimum distance 


batch sampling design described in Section 3A is intended to deliver efficient mapping of the 
complete surface, S{x), over the region of interest. Detection and subsequent evaluation of 
sub-regions where policy-determined prevalence thresholds could help guide more targeted 
intervention measures, would require progressive concentration of sampling into areas of 
relatively high prevalence. 
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In our application to malaria prevalence mapping, we used an initial set of rMIS data to map 
disease prevalence in focal area B and analysed the resulting data to define a follow-up sample 
of new locations with the aim of reducing as much as possible the average prediction variance. 
The batch size is large because of the high cost in staff and travel time of re-visiting the study 
region more often than monthly. Smaller batch sizes, if feasible, would potentially lead to 
greater gains in efficiency. 

The adaptive sampling design approach is of potentially wide application to disease mapping in 
low resource settings, where accurate registry data typically do not exist. Mapping exercises are 
an important component of any control or elimination programme. Collecting data adaptively 
allows for local identihcation and targeting of areas with high transmission, incidence or preva¬ 
lence, and an understanding of which household-level and community-level factors influence 
these properties. Knowledge of these properties can inform area-wide health policymaking and 
identify locations of greatest need where interventions that would be considered too costly or 
complicated to implement across an entire population can be targeted in order to optimise 
their public health impact. 

The choice of the initial sampling design Xq is an important step for adaptive sampling. The 
initial sample size, no, needs to be large enough to allow the htting of a geostatistical model, 
whose estimate parameter values then drive the adaptive sampling. In the Majete application, 
we prescribed no = 100 but, in the event, found eligible study participants in 72 of the sampled 
households. We recommend re-estimation of the model parameters after each batch of locations 
has been added. 


In the Majete application, the irregular spatial distribution of households across the study- 
region meant that the set of 122 sampled locations after the hrst batch of adaptively sampled 
locations had been added to the initial design achieved a good compromise between even 
coverage of the study-region and the inclusion of close pairs, which is generally helpful for 
efficient parameter estimation. In other contexts, and specihcally where there is essentially no 
restriction of the placement of sampling locations, it would be better to use an initial design 


that deliberately includes some close pairs, as recommended in Diggle and Lophaven (2006). 


In conclusion, the proposed adaptive sampling design approach provides a systematic approach 
to the collection of exposure and outcome data over time so as to optimise progress towards 
achievement of the analysis objective. Adaptive designs are particularly well suited to spatial 
mapping studies in low resource settings where uniformly precise mapping may be unrealistically 
costly and the priority is often to identify critical areas where interventions can have the 
greatest health impact. Development of adaptive geostatistical design methodology is therefore 
timely for monitoring and evaluating interventions in tropical diseases with high burden such as 
malaria, in areas where accurate disease registries do not exist and resources are severely limited. 
Malaria in particular is a leading cause of death in most of sub-Saharan Africa, especially 
among children under 5 years of age. Malaria monitoring and control programmes can beneht 
from the availability of accurate prevalence maps. Geostatistical analysis in conjunction with 
adaptive sampling is an effective, practical strategy for producing accurate local-scale maps 
that can pick up short-term changes in disease burden and that are complementary to the 


national-scale maps that have been produced, for example, by Hay et al. (2004), Guerra et al 


(2007) 

Hay et al. 

(2009 

) and 

Gething et al. 

(2012 
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